Retail in Municipalities:

Recommendation for opening a business through the integration of time series and neural network based on IBGE data.

Click below to select which one you want to view.

Abstract

This study, using data from IBGE (Brazilian Institute of Geography and Statistics) from 1991, 2000 and 2010, sought to create metrics which allowed suggestions for opening a new retail enterprise in promising municipalities. The approach was based on the premise that certain factors are decisive in the success of a retail business. While others simply have little or no relevance. The study carried out did not seek to prove the previous statement, but to imagine a scenario in which this premise is real and allows suggesting a location for opening a new business. Without external criteria that make it possible to base the “success” of the establishment, it was established that the metrics of population and per capita income are decisive. Through the integration of time series (ARIMA) and neural network (Keras), models that start from different assumptions. Implying, in a certain way, an “agreement” between what was predicted by the models. While ARIMA projected data for the year 2020, Keras applied the model learned in the years 1991, 2000 and 2010 to the data projected by ARIMA. Finally, “concordant” data between the models was selected and the three best cities were selected based on filters and two investor profiles. One profile seeks a high profit margin at the expense of sales volume, the other seeks a high sales volume at the expense of profit margin.

Introduction

Imagining a scenario in which a retailer is interested in opening a new business in Brazilian municipalities. And using IBGE data from 1991, 2000 and 2010. Which municipality suggestion would be possible? Why?

This is a primary study and its scope is limited to IBGE data. In this study, we sought to make predictions for the year 2020 (ARIMA) and “validate” them with Keras.

Objectives

Its general goals were:

  • Demonstrate the author’s repertoire regarding the contents covered here.
  • Develop metrics that allow suggestions for opening a retail venture in a municipality.

Its specific aim to:

  • Project data to 2020 with ARIMA.
  • Train Keras in 1991, 2000 and 2010.
  • Apply models learned by Keras in 2020.
  • Analyze agreement between Keras and ARIMA.
  • Create filters for selecting municipalities.

Methodology

ARIMA was chosen because it is especially effective at modeling trends and linear temporal patterns in data.

Keras, on the other hand, was chosen for its ability to deal with complex and non-linear relationships, mainly between different characteristics or variables in the data set.

In other words, the objective is the same, but using different paths. In this way, assessing an “agreement” between the predictions of both models. Using ARIMA and Keras combined, they make it possible to infer a mutual (de)validation of predictions.

Tools

To display the libraries and tools used, click the button:

R Libraries:

library(readxl)     # xls to df
library(dplyr)      # %>% e etc
library(plotly)     # graphic
library(foreign)   # dbf
library(reticulate) # python
library(keras)      # ml 
library(tensorflow) # ml
library(forecast)   # ml
library(writexl)    # export

set.seed(0)

Database

To display details about the Data Frame click on the button:

df_raw <- read_excel("IBGE_1991_2000_2010.xlsx", sheet = 1)
df_raw

The variables are, respectively…:

  • CODMUN: Municipality Code.
  • YEAR: Indexation 1991, 2000 and 2010.
  • NOMEMUN: Name of the Municipality.
  • UF: State.
  • REGION: Region.
  • AREA_KM2: Territorial extension in km².
  • KM_DIST_CAP: Distance from the capital in km.
  • ESPVIDA: (Esperança de vida) Life expectancy.
  • FECTOT: Total fertility rate.
  • T_ENV: Aging rate.
  • GINI: Gini Index.
  • PIND: Economically fragile population.
  • RDPC: Per capita income.
  • PESO15: Population aged 15 and over.
  • PESOTOT: Total resident population.
  • DEM_DEMO: Demographic density.
  • TXDSEMP: Unemployment Rate.


1. Data preparation

1.1. Subdividing by year.

# Separando por anos
df_1991 <- data.frame(df_raw[df_raw$ANO == 1991,])
df_2000 <- data.frame(df_raw[df_raw$ANO == 2000,])
df_2010 <- data.frame(df_raw[df_raw$ANO == 2010,])

Checking if the data frames are equivalent:

# Checando se os dataframes têm entradas idênticas (Em particular os municípios).
## Checando se colunas têm o mesmo tamanho
if (all(length(df_1991$NOMEMUN) == length(df_2000$NOMEMUN), length(df_1991$NOMEMUN) == length(df_2010$NOMEMUN))) {
## Checando se colunas têm o mesmo valor (municípios)
  if (identical(sort(df_1991$NOMEMUN), sort(df_2000$NOMEMUN)) &&
      identical(sort(df_2000$NOMEMUN), sort(df_2010$NOMEMUN)))
    {cat("Valores iguais.")
  } else {cat("Alguns valores diferentes.")
  }} else {cat("Tamanhos diferentes")}
## Valores iguais.
if (identical(df_1991$NOMEMUN, df_2000$NOMEMUN) && identical(df_2000$NOMEMUN, df_2010$NOMEMUN)) {cat("Nomes e posições exatamente iguais.")}
## Nomes e posições exatamente iguais.

Deleting currently irrelevant data.

# Selecionando colunas de interesse e isolando (Por index do município)
df_1991_ByMdx <- data.frame(df_1991[,c(16,6,7,8,9,10,11,12,13,14,15,17,18)])
df_2000_ByMdx <- data.frame(df_2000[,c(16,6,7,8,9,10,11,12,13,14,15,17,18)])
df_2010_ByMdx <- data.frame(df_2010[,c(16,6,7,8,9,10,11,12,13,14,15,17,18)])

1.2. Preparing Keras

consolidado <- data.frame(rbind(df_1991_ByMdx,
                                df_2000_ByMdx,
                                df_2010_ByMdx))

# Atribuindo para objeto sem Index do Município
df_keras <- data.frame(consolidado[,-1])
df_ranked <- data.frame(df_keras[,1:7])
df_keras

1.3. Correlations

Preliminary correlation analysis in the years 1991, 2000 and 2010:

# Selecionando variáveis
cor_consolidado <- cor(consolidado[, c("AREA_KM2","KM_DIST_CAP",
                                       "ESPVIDA","FECTOT","T_ENV","GINI",
                                       "PIND","RDPC","PESO15","PESOTOT",
                                       "DEM_DEMO","TXDSEMP")])
# Arredondando resultado
cor_consolidado <- round(cor_consolidado,2)
# Heatmap
G_HM_cor_cons <- plot_ly(
  type = "heatmap",
  colorscale = "Portland",
  z = cor_consolidado,
  x = colnames(cor_consolidado),
  y = rownames(cor_consolidado),
  zmin = -1,  
  zmax = 1,
  reversescale = TRUE
  ) %>% layout(
    title = "Matriz de Correlação",
    font = list(color = '#FFFFFF'),
    paper_bgcolor = "#222222",
    showlegend = FALSE
  )

# Exibindo anotações no gráfico onde valor diferente de 1.
for (nr in 1:nrow(cor_consolidado)) {
  for (nc in 1:ncol(cor_consolidado)) {
    if (cor_consolidado[nr, nc] != 1) {
      G_HM_cor_cons <- G_HM_cor_cons %>%
        add_annotations(
          text = round(cor_consolidado[nr, nc], 6),
          x = colnames(cor_consolidado)[nc],
          y = rownames(cor_consolidado)[nr],
          showarrow = FALSE,
          font = list(size = 14, color = "white"))}}}

G_HM_cor_cons

Although the data above is rich in socioeconomic information, in this study RDPC, PESOTOT and DEM_DEMO were selected.


2. ARIMA

The code below selects each municipality individually in the years 1991, 2000 and 2010. And based on this linear chronological organization, it projects data for 2020.

# Indexando ano
df_dec <- cbind(df_raw[2], df_keras)
# Removendo Areas
arima_df <- df_dec[, c(1,4,5,6,7,8,9,10,11,12,13)]
# Separando as décadas
arima_1991 <- data.frame(arima_df[arima_df$ANO == 1991,])
arima_2000 <- data.frame(arima_df[arima_df$ANO == 2000,])
arima_2010 <- data.frame(arima_df[arima_df$ANO == 2010,]) 

# Reset para debug
arima_2020 <- arima_2010[1,]                        # Adiciona linha e header
arima_2020 <- arima_2020[-nrow(arima_2020),]        # Remove linha e mantém header
A_arima_2020 <- arima_2010[1,]                      # Adiciona linha e header
A_arima_2020 <- A_arima_2020[-nrow(A_arima_2020),]  # Remove linha e mantém header

# Para cada elemento em cada linha do df_1991_1:
for (k in seq_len(nrow(arima_1991))) {
  # Seleciona linhas
  df_temp_forecast <- arima_1991[k,]
  df_temp_forecast <- rbind(df_temp_forecast, arima_2000[k,])
  df_temp_forecast <- rbind(df_temp_forecast, arima_2010[k,])
  arima_2020 <- arima_2010[k,] # Apenas pela estrutura, será sobrescrito. 
  arima_2020$ANO[1] <- 2020
  # Colunas para iteração
  cols_var <- c("ESPVIDA","FECTOT", "T_ENV","GINI","PIND",
                "RDPC","PESO15","PESOTOT","DEM_DEMO","TXDSEMP")
  for (i in cols_var) {
    arima_2020[[i]][1] <- mean(df_temp_forecast[[i]]) # 0 ou media
    temp_col <- df_temp_forecast[[i]]    # seleção de coluna predição
    try({ temp_line <- arima(temp_col, order = c(0,0,1))},silent = TRUE)
    forecast_2020 <- forecast(temp_line, h = 1)     # h = 1 (2020)
    arima_2020[[i]][1] <- forecast_2020$mean[1]} # Resultado a se guardar
  # Joga resultado para df_2020_1
  A_arima_2020 <- rbind(A_arima_2020, arima_2020[1, ])
  arima_2020 <- arima_2020[-nrow(arima_2020),]
  # Debug para acompanhamento
  if (k %% 500 == 0) {print(k)}
  else if (k == 5565) {print("Concluído")}}
## [1] 500
## [1] 1000
## [1] 1500
## [1] 2000
## [1] 2500
## [1] 3000
## [1] 3500
## [1] 4000
## [1] 4500
## [1] 5000
## [1] 5500
## [1] "Concluído"
# Exportando para csv
write.csv(A_arima_2020, "Arima_IBGE_2020", row.names = FALSE)
write.csv(A_arima_2020, "Arima_IBGE_2020_RowName", row.names = TRUE)

With ARIMA’s projections for a 2020 IBGE, the values are consolidated so that they can then be applied by Keras.

Arima_Bkp <- read.csv("Arima_IBGE_2020")
# Checando NA's
any(is.na(Arima_Bkp))
## [1] FALSE
Arima_Prdct <- data.frame(Arima_Bkp)

# Padronizando data frames
# Inserindo variáveis
Arima_Prdct$AREA_KM2 <- df_keras$AREA_KM2[1:5565]
Arima_Prdct$KM_DIST_CAP <- df_keras$KM_DIST_CAP[1:5565]
# Reordena e não seleciona $ANO de 2020
Arima_Prdct <- Arima_Prdct %>%
  select(c(
    "AREA_KM2", "KM_DIST_CAP","ESPVIDA", "FECTOT", "T_ENV",
    "GINI","PIND","RDPC","PESO15","PESOTOT","DEM_DEMO","TXDSEMP"
  ))


3. KERAS

Having projected data for 2020 with ARIMA (using the years 1991, 2000 and 2010). Now a model has been trained for later application in 2020. That is:

  • Keras initially learned relationships and correlations between variables, having RPDC, PESOTOT and DEM_DEMO as dependent variables.
  • In a second moment, this model learned was applied to the data projected for 2020 by ARIMA.

Click above to select which one you want to view.

3.1. RDPC

3.1.1. Training/Validation (RDPC)

# Selecionando target
x <- df_keras[, c("AREA_KM2", "KM_DIST_CAP", "ESPVIDA", "FECTOT", "T_ENV",
                  "GINI", "PIND","PESO15","PESOTOT","DEM_DEMO","TXDSEMP")]
y <- df_keras$RDPC

# Normalização dos dados
x <- scale(x)

# Definindo o modelo DNN
model_keras <- keras_model_sequential()
model_keras %>%
  layer_dense(units = 64, activation = 'relu', input_shape = dim(x)[2]) %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dense(units = 1)

# Compilando modelo
model_keras %>% compile(
  optimizer = optimizer_rmsprop(),
  loss = 'mse', 
  metrics = c('mae')
)

# Treinamento
history <- model_keras %>% fit(
  x, y,
  epochs = 100, 
  validation_split = 0.2, # Split treinamento/validação
  verbose = 0 # Ocultar Epoch's
)

# Desempenho do modelo
metrics <- model_keras %>% evaluate(x, y)
## 522/522 - 1s - loss: 3907.9136 - mae: 34.8890 - 562ms/epoch - 1ms/step
print(paste("Mean Absolute Error (MAE):", metrics[2]))
## [1] "Mean Absolute Error (MAE): 34.888973236084"
print(paste("Mean Squared Error (MSE):", metrics[1]))
## [1] "Mean Squared Error (MSE): 3907.91357421875"
# Curva de aprendizado
plot(history)

metrics
##       loss        mae 
## 3907.91357   34.88897
# Análise Comparativa
# Predições realizadas (Para ranking)
predicoes_RDPC <- model_keras %>% predict(x)
## 522/522 - 1s - 556ms/epoch - 1ms/step
# Df pred/real
comparativo_RDPC <- data.frame(Previsoes = predicoes_RDPC, Real = y)

# Dispersão
plot(comparativo_RDPC$Real, comparativo_RDPC$Previsoes,
     xlab = c("RDPC Real", metrics),
     ylab = "Previsões do Modelo",
     main = "Comparação entre Previsões e Valores Reais")

# Keras RDPC (Ranking dos Municípios)
df_ranked$Pred_RDPC_Rnk <- predicoes_RDPC
df_ranked$PESO15 <- df_keras$PESO15
summary(predicoes_RDPC)
##        V1         
##  Min.   :  36.38  
##  1st Qu.: 182.19  
##  Median : 299.39  
##  Mean   : 340.94  
##  3rd Qu.: 469.31  
##  Max.   :1604.38

3.1.2. Applying to ARIMA 2020 (RDPC)

# Setup
x_Arima_Prdct <- Arima_Prdct[, c("AREA_KM2", "KM_DIST_CAP", "ESPVIDA",
                                 "FECTOT", "T_ENV", "GINI","PIND","PESO15",
                                 "PESOTOT","DEM_DEMO","TXDSEMP")]
x_Arima_Prdct <- scale(x_Arima_Prdct)

# Previsões
predicoes_RDPC_Arima_Prdct <- model_keras %>% predict(x_Arima_Prdct)
## 174/174 - 0s - 170ms/epoch - 974us/step
# Anexar previsões ao data frame Arima_Prdct
Arima_Prdct$Pred_RDPC <- predicoes_RDPC_Arima_Prdct

3.2. PESOTOT

3.2.1. Training/Validation (PESOTOT)

# Selecionando target
x <- df_keras[, c("AREA_KM2", "KM_DIST_CAP", "ESPVIDA", "FECTOT", "T_ENV",
                  "GINI", "PIND","RDPC","PESO15","DEM_DEMO","TXDSEMP")]
y <- df_keras$PESOTOT

# Normalização dos dados
x <- scale(x)

# Definindo o modelo DNN
model_keras <- keras_model_sequential()
model_keras %>%
  layer_dense(units = 64, activation = 'relu', input_shape = dim(x)[2]) %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dense(units = 1)

# Compilando modelo
model_keras %>% compile(
  optimizer = optimizer_rmsprop(),
  loss = 'mse', 
  metrics = c('mae')
)

# Treinamento
history <- model_keras %>% fit(
  x, y,
  epochs = 100, 
  validation_split = 0.2, # Split treinamento/validação
  verbose = 0
)

# Desempenho do modelo
metrics <- model_keras %>% evaluate(x, y)
## 522/522 - 1s - loss: 217377904.0000 - mae: 4604.7192 - 563ms/epoch - 1ms/step
print(paste("Mean Absolute Error (MAE):", metrics[2]))
## [1] "Mean Absolute Error (MAE): 4604.71923828125"
print(paste("Mean Squared Error (MSE):", metrics[1]))
## [1] "Mean Squared Error (MSE): 217377904"
# Curva de aprendizado
plot(history)

metrics
##         loss          mae 
## 2.173779e+08 4.604719e+03
# Análise Comparativa
# Predições realizadas (Para ranking)
predicoes_PESOTOT <- model_keras %>% predict(x)
## 522/522 - 1s - 526ms/epoch - 1ms/step
# Df pred/real
comparativo_PESOTOT <- data.frame(Previsoes = predicoes_PESOTOT, Real = y)

# Dispersão
plot(comparativo_PESOTOT$Real, comparativo_PESOTOT$Previsoes,
     xlab = c("RDPC Real", metrics),
     ylab = "Previsões do Modelo",
     main = "Comparação entre Previsões e Valores Reais")

# Keras RDPC (Ranking dos Municípios)
df_ranked$Pred_PESOTOT_Rnk <- predicoes_PESOTOT
summary(predicoes_PESOTOT)
##        V1          
##  Min.   :    1845  
##  1st Qu.:    8557  
##  Median :   11977  
##  Mean   :   31326  
##  3rd Qu.:   19560  
##  Max.   :12438205

3.2.2. Applying to ARIMA 2020 (PESOTOT)

# Setup
x_Arima_Prdct <- Arima_Prdct[, c("AREA_KM2", "KM_DIST_CAP", "ESPVIDA",
                                 "FECTOT", "T_ENV", "GINI","PIND",
                                 "RDPC","PESO15", "DEM_DEMO","TXDSEMP")]
x_Arima_Prdct <- scale(x_Arima_Prdct)

# Previsões
predicoes_PESOTOT_Arima_Prdct <- model_keras %>% predict(x_Arima_Prdct)
## 174/174 - 0s - 220ms/epoch - 1ms/step
# Anexar previsões ao data frame Arima_Prdct
Arima_Prdct$Pred_PESOTOT <- predicoes_PESOTOT_Arima_Prdct

3.3. DEM_DEMO

3.3.1. Training/Validation (DEM_DEMO)

x <- df_keras[, c("AREA_KM2", "KM_DIST_CAP", "ESPVIDA", "FECTOT", "T_ENV",
                  "GINI", "PIND","RDPC","PESO15","PESOTOT","TXDSEMP")]
y <- df_keras$DEM_DEMO
x <- scale(x)
model_keras <- keras_model_sequential()
model_keras %>%
  layer_dense(units = 64, activation = 'relu', input_shape = dim(x)[2]) %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dense(units = 1)
model_keras %>% compile(
  optimizer = optimizer_rmsprop(),
  loss = 'mse', 
  metrics = c('mae')
)
history <- model_keras %>% fit(
  x, y,
  epochs = 100, 
  validation_split = 0.2, # Split treinamento/validação
  verbose = 0
)
metrics <- model_keras %>% evaluate(x, y)
## 522/522 - 1s - loss: 117868.3281 - mae: 67.4136 - 824ms/epoch - 2ms/step
print(paste("Mean Absolute Error (MAE):", metrics[2]))
## [1] "Mean Absolute Error (MAE): 67.4136276245117"
print(paste("Mean Squared Error (MSE):", metrics[1]))
## [1] "Mean Squared Error (MSE): 117868.328125"
plot(history)

metrics
##         loss          mae 
## 117868.32812     67.41363
predicoes_DEM_DEMO <- model_keras %>% predict(x)
## 522/522 - 1s - 648ms/epoch - 1ms/step
comparativo_DEM_DEMO <- data.frame(Previsoes = predicoes_DEM_DEMO, Real = y)
plot(comparativo_DEM_DEMO$Real, comparativo_DEM_DEMO$Previsoes,
     xlab = c("RDPC Real", metrics),
     ylab = "Previsões do Modelo",
     main = "Comparação entre Previsões e Valores Reais")

# Keras RDPC (Ranking dos Municípios)
df_ranked$Pred_DEM_DEMO_Rnk <- predicoes_DEM_DEMO
df_ranked$TXDSEMP <- df_keras$TXDSEMP
summary(predicoes_DEM_DEMO)
##        V1           
##  Min.   : -341.524  
##  1st Qu.:    9.402  
##  Median :   26.185  
##  Mean   :   97.150  
##  3rd Qu.:   65.372  
##  Max.   :15828.087

3.3.2. Applying to ARIMA 2020 (DEM_DEMO)

# Setup
x_Arima_Prdct <- Arima_Prdct[, c("AREA_KM2", "KM_DIST_CAP", "ESPVIDA",
                                 "FECTOT", "T_ENV", "GINI","PIND","PESO15",
                                 "PESOTOT","DEM_DEMO","TXDSEMP")]
x_Arima_Prdct <- scale(x_Arima_Prdct)

# Previsões
predicoes_DEM_DEMO_Arima_Prdct <- model_keras %>% predict(x_Arima_Prdct)
## 174/174 - 0s - 212ms/epoch - 1ms/step
# Anexar previsões ao data frame Arima_Prdct
Arima_Prdct$Pred_DEM_DEMO <- predicoes_DEM_DEMO_Arima_Prdct


4. Predictions and Filters

4.1. ARIMA vs Keras

Comparison of predicted by ARIMA and Keras for 2020:

Click above to select which one you want to view.

ARIMA and Keras 2020

Organizing Data Frames:

# Checando dados faltantes
any(is.na(Arima_Prdct))
## [1] FALSE
# Predições de ARIMA para 2020.
A_ARIMA_20 <- round(Arima_Prdct[,c(8,10,11)],2)
colnames(A_ARIMA_20)[1] <- "RDPC"
colnames(A_ARIMA_20)[2] <- "PESOTOT"
colnames(A_ARIMA_20)[3] <- "DEM_DEMO"

# Predições de Keras para 2020.
A_KERAS_20 <- round(Arima_Prdct[, c(13,14,15)],2)
colnames(A_KERAS_20)[1] <- "RDPC"
colnames(A_KERAS_20)[2] <- "PESOTOT"
colnames(A_KERAS_20)[3] <- "DEM_DEMO"

ARIMA 2020

A_ARIMA_20

Keras 2020

A_KERAS_20

ARIMA/Keras Variation

When “1” the variation expressed is exact between models.

Ar_Vs_Kr <- round(A_ARIMA_20/A_KERAS_20,2)
Ar_Vs_Kr

4.2. Graphic Analysis

Data analysis and filter selection

The filters selected those municipalities where the predictions of both models agree at 20% variation. This is between 0.8 and 1.2, with 1 being perfect agreement. In this topic, the sample of candidate municipalities is expressed graphically, where the two red lines represent the considered range of the applied filter.

Click above to select which one you want to view.

Per Capita Income

hist(Ar_Vs_Kr$RDPC,
     breaks = 200,
     xlim = c(0, 2),
     ylab = "Frequência",
     xlab = "Variação",
     main = "Histograma de renda per capita em 2020: Arima VS Keras")
abline(v = 0.8, col = "red", lty = 2)
abline(v = 1.2, col = "red", lty = 2) 

Total Population

hist(Ar_Vs_Kr$PESOTOT,
     breaks = 100,
     xlim = c(0, 2),
     ylab = "Frequência",
     xlab = "Variação",
     main = "Histograma de população total em 2020: Arima VS Keras")
abline(v = 0.8, col = "red", lty = 2)
abline(v = 1.2, col = "red", lty = 2) 

Demographic Density

hist(Ar_Vs_Kr$DEM_DEMO,
     breaks = 100000,
     xlim = c(0, 2),
     ylab = "Frequência",
     xlab = "Variação",
     main = "Histograma de densidade demográfica em 2020: Arima VS Keras")
abline(v = 0.8, col = "red", lty = 2)
abline(v = 1.2, col = "red", lty = 2) 

4.3. Filters

Consolidating Data Frames, it contains:

  • Identification of the municipality (CODMUN)
  • Name of the municipality (NOMEMUN)
  • Federative Unit (UF)
  • Region (REGIAO)
  • Filter (Ar_vs_Kr): RDPC, PESOTOT, DEM_DEMO
  • Consistency Profile: Average of predictions (AxK_Mean) R,P,D
# RENOMEANDO KERAS
An_KERAS_20 <- data.frame(A_KERAS_20)
An_ARIMA_20 <- data.frame(A_ARIMA_20)
An_Mean_20 <- (An_ARIMA_20 + An_KERAS_20) / 2
colnames(An_Mean_20)[1] <- "AxK_Mean_R"
colnames(An_Mean_20)[2] <- "AxK_Mean_P"
colnames(An_Mean_20)[3] <- "AxK_Mean_D"
An_Mean_20 <- round(An_Mean_20,2)

## FILTRO
# Identificação
decisao <-  df_2010[, c(1,3,4,5)] 

# Ar_Vs_Kr: Filtro
# AxK_Mean: Estável
decisao <- cbind(decisao, Ar_Vs_Kr)
decisao <- cbind(decisao, An_Mean_20)

4.3.1. Filters

Filter: selects the municipalities where the agreement of predictions between ARIMA and Keras is 20% different. Or 0.8 and 1.2, with 1 being the exact prediction agreement.

Click above to select which one you want to view.

# Considerar apenas os dados que estão entre o primeiro e terceiro quartil
paste("There are",nrow(decisao),"potential municipalities. The filters are the 'safety margins' where there is agreement between the predictions of the two models. These are municipalities that are at 0.8 and 1.2 'in common' prediction between the ARIMA and Keras models. In other words, the 'municipalities that had a similar forecast' (1 being exact, and considering a variation of 20% above or below).")
## [1] "There are 5565 potential municipalities. The filters are the 'safety margins' where there is agreement between the predictions of the two models. These are municipalities that are at 0.8 and 1.2 'in common' prediction between the ARIMA and Keras models. In other words, the 'municipalities that had a similar forecast' (1 being exact, and considering a variation of 20% above or below)."

Filter 1

filtro1 <- subset(decisao, RDPC >= 0.8 & RDPC <= 1.2)
paste("Applying the first filter, remains:",nrow(filtro1),"potential places.",nrow(decisao) - nrow(filtro1),"municipalities removed.")
## [1] "Applying the first filter, remains: 3417 potential places. 2148 municipalities removed."

Filter 2

filtro2 <- subset(filtro1, PESOTOT >= 0.8 & PESOTOT <= 1.2)
paste("Applying the first filter, remains:",nrow(filtro2),"potential places.",nrow(filtro1) - nrow(filtro2),"municipalities removed.")
## [1] "Applying the first filter, remains: 1397 potential places. 2020 municipalities removed."

Filter 3

filtro3 <- subset(filtro2, DEM_DEMO >= 0.8 & DEM_DEMO <= 1.2)
paste("Applying the first filter, remains:",nrow(filtro3),"potential places.",nrow(filtro2) - nrow(filtro3),"municipalities removed.")
## [1] "Applying the first filter, remains: 222 potential places. 1175 municipalities removed."


5. Results

When applying the filters, from 5565 candidates we are left with 222 municipalities in which Keras and ARIMA agree with a margin of 20%.

From these results, two profiles were hypothesized as suggestions:

Click above to select which one you want to view.

resultado <- filtro3
perfil_1 <- resultado[, c(2,3,4,8,9,10)]
resultado

Top 3 Profile 1: Per Capita Income

For retailers that have a sales profile with a high profit margin (taking into account a population with high purchasing power).

perfil_1[order(-perfil_1$AxK_Mean_R),][1:3,]

Top 3 Profile 2: Demographic density

For retailers that have a sales profile with a low profit margin, but with a large quantity of sales (given the high demographic density).

perfil_1[order(-perfil_1$AxK_Mean_D),][1:3,]


6. Conclusion

The exposed approach was based on the premise that certain factors are decisive in the success of a retail venture. While others simply have little or no relevance.

The study carried out did not seek to prove the previous statement. But imagine a scenario where this premise is real. In this way, making it possible to measure variables and establish better markers for the success of this business. And, finally, an indication of a location for future studies to identify promising municipalities for opening a new retail business.

The data considered relevant were: territorial extension, distance from the capital, life expectancy, total fertility rate, aging rate, Gini index, economic vulnerability population index, population aged 15 or over and unemployment rate. In this context, the following variables were emphasized: per capita income, total resident population and demographic density.

The analysis was carried out using models that originate from different points to make different assumptions about the same goal. Thus, aiming to provide more robust predictions. Implying, in a way, an “agreement” between what was predicted by the models.

Later, filters were created as “safety margins” for this “agreement” of the models. Which, in other words, means “the municipalities that had related forecasts”. The safety margins were 20%. The first filter is per capita income, the second total resident population and the third demographic density.

Starting from 5565 potential municipalities. Applying the first filter left 3417 candidates. 1397 in the second filter. Finally, leaving 222 potential municipalities.

It was concluded that it was possible to preliminarily suggest two profiles: One seeking few sales with a high profit margin; and another seeking high recurrence of sales. The three selected in each profile were:

1:

perfil_1[order(-perfil_1$AxK_Mean_R),][1:3,]

… for high-income profile.

2:

perfil_1[order(-perfil_1$AxK_Mean_D),][1:3,]

… for high “turnover” profiles.